Data Collection Methods

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Understand the purpose of random sampling
  • How to identify randomised experiments and observational studies
  • The differences between randomised experiments and observational studies when studying a “treatment”

Why bother planning how you collect data?

Garbage in, garbage out

— Virtually all users of data

A very “recent” example is the 2018 Census carried out by Stats NZ. If you are interested in how Stats NZ addressed these concerns

To ensure that the data you collect…

  • Can answer your (research) question
  • Can be analysed with a (statistical) model
  • Is the correct data

Definition: Bias

On average, how “far away” is our estimate of the “truth”

  • For example, is the sample mean gross weekly income in CS 1.1 close to the (unknown) population mean gross weekly income in the June quarter of 2011?

There can be multiple reasons for any biases in our estimates of the “truth”, for example:

  • The way we calculated the estimate
  • Our sample data is not representative of the population
  • The way we measured the values

Definition: Precision

On average, how variable is the estimate of the “truth”

  • For example, consider a science experiment where you measure the velocity of a falling object

Like bias, there can be multiple reasons for imprecise estimates of the “truth”, for example:

  • Not enough sample data
  • The way we measured the values

Samples from populations

A population includes all individuals or objects of interest. In this context, data are typically collected from a sample, which is a subset of the population.

Why do we sample?

It can be difficult to measure and categorise variables for the target population

  • New Zealand estimated population is 5,138,154 people as at Thursday, 2nd March1
  • Gibbons2 spend most of their time in the jungle canopy

Nevertheless, a census of the target population means that anything calculated from the collected variables are the ground “truth”

A well-designed sample of the target population means that we get an accurate estimate of the ground “truth”

NZ political opinion polls

Source: www.andrewchen.nz/polls

Population ⇝ Sample ⇢ Population

Parameters and Statistics

The ground “truth” is known as a (population) parameter. For example:

  • The population mean, \(\mu\)
  • The population standard deviation, \(\sigma\)

The estimate of a parameter, based on our sample data, is known as a statistic

  • The sample mean, \(\bar{x}\)
  • The sample standard deviation, \(s\)

Definition: Simple random sample

All potential observations have the same chance to be selected for the sample

Why should we take random samples from the target population?

  • Minimises potential biases due to selection
  • Ensures that the selection of each observation is independent of another
# Randomly select ten integers (numbers) between 1 and 100, 
#   without replacement
sample(1:100, size = 10)
 [1] 100  10   5  89  42  12   2  69  51  87

Sampling error

The survey literature makes a distinction between sampling errors, which arise from the decision to take a sample rather than trying to survey the whole population (which is what a census tries to do)

— Wild & Seber (2000)

Sometimes—by chance—we can select a “bad” simple random sample that may not be representative of the population

# Randomly select ten integers (numbers) between 1 and 100, 
#   without replacement
sample(1:100, size = 10)
 [1]  6  9  1  7  3  5  4 10  2  8

\[ (1/100)^{10} = 1 \times 10^{-20} \]

Sampling error cont’d

In light of Slide 12, we can demonstrate that—on average—a statistic calculated from a simple random sample is an unbiased estimate of the (unknown) parameter

Why? Simple random samples1 provide a neat “predictable” property, even though randomness is involved

More on this once we start the Introduction to Statistical Inference topic

Definition: Stratified sample

All potential observations are first split into distinct groups (strata). Then we take a simple random sample of all potential observations within each group (stratum)

Why are stratified samples useful?

  • The sampling strategy ensures that each stratum has “similar” observations
  • Ensures that the selection of one observation is independent of another within and between stratum
# Randomly select ten integers (numbers) between 1 and 10, 
#   with replacement
sample(1:10, size = 10, replace = TRUE)
 [1] 10  2  2  6  9  1  1 10  3  4

CS 1.1 revisited: NZ income snapshot in 2011

Recall the following piece of context:

The survey was an annual snapshot to produce income statistics on New Zealanders aged 15 and over based on a representative sample of the population.

One strategy to get a representative sample of the population is to conduct a stratified sample. Why?

nzis.df <- read.csv("datasets/NZIS-CART-SURF-2011.csv")
xtabs( ~ region, data = nzis.df) |>
  proportions() |>
  round(2)
region
    Auckland          BoP Christchurch     Gisborne     Manawatu       Nelson 
        0.31         0.06         0.15         0.05         0.05         0.04 
   Northland        Otago    Southland     Taranaki      Waikato   Wellington 
        0.04         0.05         0.02         0.02         0.09         0.12 

Here are the population proportions calculated for 20061

Region
    Auckland          BoP Christchurch     Gisborne     Manawatu       Nelson 
        0.33         0.06         0.13         0.05         0.05         0.04 
   Northland        Otago    Southland     Taranaki      Waikato   Wellington 
        0.04         0.05         0.02         0.03         0.09         0.11 

Extensions to T011: Side-by-side plots by region

bwplot(region ~ income, data = nzis.df, pch = "|",
       xlab = "Gross Weekly Income ($)", ylab = "Region",
       main = "NZer's gross weekly income snapshot in 2011 by region")

Figure: The gross weekly income of 29447 New Zealanders by region

The variable is plotted side-by-side for each level of the categorical variable. The goal of side-by-side plots is to compare and contrast the plotted variable between levels

With the lattice R package, we can make side-by-side plots of:

  • Dot plots, stripplot()
  • Box plots, bwplot()
  • Bar plots, barchart() (and with colour)

Extensions to T011: Panel plots by qualification

histogram( ~ income | qualification, data = nzis.df, nint = 25,
          xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011 by qualification")

Figure: The gross weekly income of 29447 New Zealanders by qualification

The variable is plotted for each level of the categorical variable in its own panel. The goal of panel plots is to also compare and contrast the plotted variable between levels

With the lattice R package, we can make panel plots of:

  • Dot plots, stripplot()
  • Box plots, bwplot()
  • Histograms, histogram()
  • Scatter plots, xyplot()
  • Bar plots, bachart()

Extensions to T011: Introducing colour by sex

stripplot( ~ income, group = sex, data = nzis.df, jitter.data = TRUE,
          factor = 10, xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011 by sex",
          auto.key = list(title = "Sex", space = "right"))

Figure: The gross weekly income of 29447 New Zealanders by sex

The variable is plotted and the values are colour-coded by the levels of the categorical variable. The goal of colour is to help distinguish between levels

With the lattice R package, we can make use of colour in:

  • Dot plots, stripplot()
  • Scatter plots, xyplot()
  • Bar plots, bachart() (and with side-by-side bars)

Extensions to T011: Descriptive statistics by region

Producing descriptive statistic(s) of a numeric variable for each level of a categorical variable requires additional R code:

# The standard set of summary statistics of income by region
split(nzis.df, ~ region) |>
  lapply(\(x) summary(x$income))
$Auckland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-5100.0   243.0   542.0   720.2   989.0 25443.0 

$BoP
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1688.0   197.0   470.0   619.7   901.0 11603.0 

$Christchurch
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-5100.0   246.0   545.5   701.0   970.0 16174.0 

$Gisborne
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -518.0   243.2   521.5   692.5   924.8 21104.0 

$Manawatu
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2379.0   234.0   559.0   655.7   948.2 10081.0 

$Nelson
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-4420.0   232.0   518.0   679.5   941.5 17538.0 

$Northland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3551.0   247.5   540.0   666.7   938.0  6967.0 

$Otago
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3551.0   256.0   570.0   685.6   970.0  8439.0 

$Southland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1308.0   200.0   491.0   647.8   994.0  4631.0 

$Taranaki
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -413.0   223.5   469.0   634.1   884.5  5866.0 

$Waikato
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2110.0   223.0   518.5   647.2   928.8 14782.0 

$Wellington
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3551.0   254.5   574.0   729.5  1001.0 18369.0 
# The sample standard deviation of income by region
split(nzis.df, ~ region) |> 
  lapply(\(x) sd(x$income))
$Auckland
[1] 885.168

$BoP
[1] 682.2891

$Christchurch
[1] 810.055

$Gisborne
[1] 959.4897

$Manawatu
[1] 646.0911

$Nelson
[1] 850.093

$Northland
[1] 711.9051

$Otago
[1] 670.9319

$Southland
[1] 630.7546

$Taranaki
[1] 658.1127

$Waikato
[1] 681.1972

$Wellington
[1] 840.5936

Random sampling solves it all?

We have so far introduced concepts relating to taking samples from the population. Recall the following point raised in Slide 4

Our sample data is not representative of the population

In practice, taking random samples from the correct population to answer a research question is the hardest part!

  • Target Population
    The population of observations we wish to sample
  • Sampling Frame
    The population of potential observations we can sample
  • Sampled Population
    The “intersection” of the target population and sampling frame

Briefly on non-sampling errors

What we discussed on the previous slide is known as Selection Bias, when the target population and sampling frame do not intersect

Some other common issues for a variety of fields are:

  • Measurement Error
    Values are not measured or categorised properly due to a “poor definition”1
  • Nonresponse Bias
    In a survey or opinion poll setting, the sampled observations do not complete a portion of the survey or poll2

Reference: Other sampling methods

Cluster sampling
All potential observations have a chance to be selected for the sample. However, the researcher selects groups of units

Systematic sampling
All potential observations have a chance to be selected for the sample by

  1. Choosing a random starting point
  2. Select every \(i\)th unit from the sampling frame

Self-selected sampling
All potential observations choose whether they are selected for the sample

Choice sampling Judgement sampling
The researcher(s) choose which potential observations are selected for the sample

Randomised experiments

A randomised experiment is a study in which the researcher actively controls one or more explanatory variables. Additionally, the values of the explanatory variable(s) are randomly assigned to the units before the response variable is measured.

Statistical principles of experimental design

Ronald Fisher1 identified three important principles that should be considered when designing an experiment

Replication to judge if the observed differences in the experimental data are due to a “signal” rather than “noise”. Practical constraints often limit the number of replicates

Randomisation of the replication order and explanatory variable values we hypothesise cause a “signal” to justify that each observation is independent of another. Furthermore, randomisation ensures that each explanatory variable value has the same chance of being assigned to “good” or “bad” observations

Incorporate blocks, where possible, to ensure that the observed differences in the experimental data are calculated from groups of similar observations.

Definition: Response and Explanatory Variables

Response variable Outcome, Dependent
A variable that we believe changes in value because of the explanatory variable

Explanatory variable Treatment, Independent
A variable that we use to understand how the response variable changes in value. In randomised experiments, we often control the values of the explanatory variable(s)

Association & Causation

Association
Two variables are associated if values of one variable tend to be related to the values of the other variable

Causation
Two variables are causally associated if changing the value of one variable influences the value of the other variable

What is the difference between association & causation?
Causation means that changes in the explanatory produces predictable changes in the response, but not the other way around

Definition: Confounding variable

A confounding variable (confounders) is a third variable that is associated with both the explanatory variable and response variable. A confounding variable can offer a plausible explanation for an association between two variables of interest.

— Lock et al. (2021)

The goal of randomised experiments is to “break” any potential confounding variable(s) with random assignment of the explanatory variable1

CS 2.1: Replication with light speeds

Simon Newcomb1 experimented with a new method of measuring the speed of light in 1882, which involved using two different mirrors placed approximately 3721.865 metres apart. The following data comes from 20 repeated measurements of the passage time for light to travel from one mirror to another and back again.

The theoretical passage time for the above distance was 24.8296 millionths of a second. If this new method is unbiased and precise, the experimental data should agree with the theoretical passage time.

Variables
pass.time A number denoting the passage time for light to travel from one mirror to another and back again (millionths of a second, μs)
lightspeed.df <- read.csv("datasets/lightspeed.csv")
nrow(lightspeed.df)
[1] 20
summary(lightspeed.df)
   pass.time    
 Min.   :24.82  
 1st Qu.:24.82  
 Median :24.83  
 Mean   :24.83  
 3rd Qu.:24.83  
 Max.   :24.84  
# Let's get summary() to print more significant figures
summary(lightspeed.df, digits = 6)
   pass.time      
 Min.   :24.8200  
 1st Qu.:24.8250  
 Median :24.8280  
 Mean   :24.8285  
 3rd Qu.:24.8320  
 Max.   :24.8370  

CS 2.1: Exploring pass.time

stripplot( ~ pass.time, data = lightspeed.df, jitter.data = TRUE,
          factor = 5, main = "Measurements of passage time for light from Newcomb's experiment",
          xlab = "Passage time (millionths of a second)")

Figure: The distribution of passage times for light to travel from one mirror to another and back again

Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.

CS 2.2: Fish respiration rates

A professor carried out an experiment to determine the best calcium level to ensure that fish have low respiration rates. The fish were randomly asssigned to three tanks with different levels of calcium.

Variables
Calcium A factor denoting the calcium level of the tank, Low, Medium or High
GillRate A number denoting the respiration rate of the fish (gill beats per minute, gbpm)
respiration.df <- read.csv("datasets/fish-respiration.csv")
nrow(respiration.df )
[1] 90
split(respiration.df, ~ Calcium) |> 
  lapply(\(x) summary(x))
$High
   Calcium             GillRate    
 Length:30          Min.   :37.00  
 Class :character   1st Qu.:45.75  
 Mode  :character   Median :58.50  
                    Mean   :58.17  
                    3rd Qu.:68.00  
                    Max.   :85.00  

$Low
   Calcium             GillRate    
 Length:30          Min.   :44.00  
 Class :character   1st Qu.:55.50  
 Mode  :character   Median :65.00  
                    Mean   :68.50  
                    3rd Qu.:84.75  
                    Max.   :98.00  

$Medium
   Calcium             GillRate    
 Length:30          Min.   :33.00  
 Class :character   1st Qu.:46.00  
 Mode  :character   Median :59.50  
                    Mean   :58.67  
                    3rd Qu.:68.75  
                    Max.   :83.00  

CS 2.2: Exploring GillRate versus Calcium

bwplot(GillRate ~ Calcium, data = respiration.df, pch = "|",
       factor = 5, main = "Respiration rate versus the tank's calcium level",
       xlab = "Calcium level", ylab = "Respiration rate (gbpm)")

Figure: The distribution of passage times for light to travel from one mirror to another and back again

Recall that we want to judge if the observed differences in the experimental data are due to a “signal” rather than “noise”

Briefly on blocking

In a block design, experimental units (observations) are first divided into homogeneous groups called blocks, and each treatment is randomly assigned to one or more units within each block

— Utts & Heckard (2015)

Strawberries
Consider designing an experiment to find out if the application of herbicides would harm the growth of strawberry plants. You have four kinds of herbicide, A–D, and one control treatment

Driving Impairment
Consider designing an experiment to determine if the following treatments can causally explain driving ability: alcohol, marijuana, or sober

Observational studies

An observational study is a study in which the researcher does not actively control the explanatory variable(s). The researcher simply observes the values of the explanatory variable(s) as they naturally exist

The three broad types of observational studies

Retrospective Studies

The researchers sample observations to make inferences based on variables, some of which were measured or categorised previously

Cross-sectional Studies

The researchers sample observations to make inferences based on variables made at a specific point in time

Prospective Studies (Longitudinal Studies)

The researchers sample observations to make inferences based on how variables change over time

CS 2.3: The impact of smoking

This dataset describes an observational study of youths who lived in East Boston, Massachusetts, USA, sometime during the 1970s. The researchers followed these youths for seven years, and their primary research question was whether smokers suffered from reduced lung capacity.

Variables
Age An integer denoting the age of a subject (in years)
Height_In A number denoting the height of a subject (in inches)
Sex A factor denoting the sex of the subject, male or female
Smoke A factor denoting the smoking status of the subject, non-smoker or smoker
LungCap A number denoting the lung capacity of the subject (unitless)
lung.df <- read.csv("datasets/lung-capacity.csv")
nrow(lung.df)
[1] 654
head(lung.df)
  Age Height_In    Sex      Smoke LungCap
1   9      57.0 Female Non-Smoker   1.708
2   8      67.5 Female Non-Smoker   1.724
3   7      54.5 Female Non-Smoker   1.720
4   9      53.0   Male Non-Smoker   1.558
5   9      57.0   Male Non-Smoker   1.895
6   8      61.0 Female Non-Smoker   2.336
summary(lung.df)
      Age           Height_In         Sex               Smoke          
 Min.   : 3.000   Min.   :46.00   Length:654         Length:654        
 1st Qu.: 8.000   1st Qu.:57.00   Class :character   Class :character  
 Median :10.000   Median :61.50   Mode  :character   Mode  :character  
 Mean   : 9.931   Mean   :61.14                                        
 3rd Qu.:12.000   3rd Qu.:65.50                                        
 Max.   :19.000   Max.   :74.00                                        
    LungCap     
 Min.   :0.791  
 1st Qu.:1.981  
 Median :2.547  
 Mean   :2.637  
 3rd Qu.:3.119  
 Max.   :5.793  

CS 2.3: Exploring LungCap by Smoke

histogram( ~ LungCap | Smoke, data = lung.df,
          main = "Lung capacity by Smoker status", nint = 10,
          xlab = "Lung capacity (unitless)")

Figure: The distribution of lung capacity for each smoker status level

Remember the goal of side-by-side plots is to compare and contrast

CS 2.3: Exploring LungCap vs. Age by Smoke

xyplot(LungCap ~ Age | Smoke, data = lung.df,
       main = "Lung capacity versus Age by Smoker status",
       xlab = "Age (years)", ylab = "Lung capacity (unitless)")

Figure: A scatter plot of lung capacity versus age for each smoker status level

Remember the goal of side-by-side plots is to compare and contrast

CS 2.3: Exploring LungCap vs. Age by Smoke & Sex

xyplot(LungCap ~ Age | Smoke + Sex, data = lung.df,
       main = "Lung capacity versus Age by Smoker status and Sex",
       xlab = "Age (years)", ylab = "Lung capacity (unitless)")

Figure: A scatter plot of lung capacity versus age for each pair of smoker status and sex levels

Remember the goal of side-by-side plots is to compare and contrast

Randomised experiments versus Observational studies

Random assignment of the explanatory variable values to observations versus (Random) sampling of observations whose explanatory variable values are simply observed.

Clinical trials versus Case-control studies

Clinical trials can be designed as a randomised experiment to quantify the effectiveness of a treatment, e.g. vaccination and no vaccination (placebo)

Case-control studies can be designed as a stratified sample to compare two groups, e.g. vaccinated and unvaccinated